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0.  ABSTRACT 

We  assume  a  parallel  RAM  model  which  allows 
both  concurrent  writes  and  concurrent  reads  of  global 
memory.  Our  algorithms  are  randomized:  each  proces¬ 
sor  is  allowed  an  independent  random  number  genera¬ 
tor.  However  our  stated  resource  bounds  hold  for  worst 
case  input  with  overwhelming  likelihood  as  the  input 
size  grows. 

We  give  a  new  parallel  algorithm  for  integer  sort¬ 
ing  where  the  integer  keys  are  restricted  to  at  most 
polynomial  magnitude.  Our  algorithm  costs  only  loga¬ 
rithmic  time  and  is  the  first  known  where  the  product 
of  the  time  and  processor  bounds  are  bounded  by  a 
linear  function  of  the  input  size.  These  simultaneous 
resource  bounds  are  asymptotically  optimal.  All  previ¬ 
ous  known  parallel  sorting  algorithms  required  at  least 
a  linear  number  of  processors  to  achieve  logarithmic 
time  bounds,  and  hence  were  nonoptimal  by  at  least  a 
logarithmic  factor. 


A  large  literature  exists  on  efficient  sequential 
RAM  algorithms  with  time  bound  linear  in  the  input 
size.  Many  of  these  algorithms  require  sorts  to  be  done 
on  integers  of  at  most  polynomial  magnitude.  For 
example,  the  depth  first  search  algorithms  of  [Tarjan, 
72]  and  [Hopcroft  and  Tarjan,  73]  require  the  edges 
(which  may  be  considered  integers)  to  be  sorted  into 
adjacency  lists.  A  fi(n  log  n)  comparison  sort  such 
as  QUICK-SORT  or  HEAP-SORT  would  not  be 
sufficiently  efficient  for  these  applications.  Instead,  the 
BUCKET-SORT  (see  [Aho,  Hopcroft,  and  Ullman,  74]) 
is  used  to  sort  in  linear  time.  The  BUCKET-SORT 
algorithm  is  sufficiently  simple  and  elegant  so  that  it  is 
widely  used  in  practice. 

The  goal  of  this  paper  is  to  develop  an  efficient 
‘  and  possibly  practical  integer  sorting  algorithm  for  a 
parallel  RAM  model,  but  we  will  utilize  quite  different 
techniques-such  as  randomization. 


John  H.  Reif* 

Aiken  Computation  Lab. 
Harvard  University 
Cambridge,  Massachusetts 


March,  1985 


0.  ABSTRACT 

We  give  new  parallel  algorithms  for  integer  sorting  and  undirected  graph 
connectivity  problems  such  as  connected  components  and  spanning  forest.  -Our-  ^  '  ' 

algorithms  cost  only  logarithmic  time  and  are  the  first  known  that  are  optimal: 
the  product  of  their  time  and  processor  bounds  are  bounded  by  a  linear  function 
of  the  input  size.  All  previous  known  parallel  algorithms  for  these  problems  required 
at  least  a  linear  number  of  processors  to  achieve  logarithmic  time  bounds,  and  hence 

were  nonoptimal  by  at  least  a  logarithmic  factor. 

i  '  ■  *•  t  ■  <-  ■  i 

Wer  assume  a  parallel  RAM  model  which  allows  both  concurrent  writes  and  concurrent 
reads  of  global  memory,  ©or  algorithms  are  randomized-,  each  processor  is  allowed  an 
independent  random  number  generator;  however  our  stated  resource  bounds  hold  for 
worst  case  input  with  overwhelming  likelihood  as  the  input  size  grows. 
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This  work  was  supported  by  Office  of  Naval  Research  Contract  N00014-80-C-0647. 


I. 


INTRODUCTION 


1 . 1  Optimal  Sequential  RAH  Algorithms 

A  large  literature  exists  on  efficient  sequential  algorithms  with  time  bound 
linear  in  the  input  size.  This  literature  generally  assumes  the  sequential  Random 
Access  Machine  Model  (RAM);  for  an  introduction  to  this  literature  see  [Aho,  Hopcroft, 
and  Ullman,  74] .  Perhaps  the  most  influential  works  done  in  this  area  were  the  graph 
algorithms  of  [Tarjan,  72]  and  {Hopcroft  and  Tarjan,  73].  These  efficient  sequential 
algorithms  relied  on  linear  time  algorithms  for  (1)  bucket  sort,  and  (2)  depth 
first  search. 

This  linear  time  bucket  sort  was  essential  to  depth  first  search  since  the  edges 
must  be  sorted  into  adjacency  lists.  By  ingenious  use  of  both  (1)  and  (2),  Hopcroft 
and  Tarjan  derived  linear  time  algorithms  for  graph  problems  such  as  connected  compo¬ 
nents,  spanning  forest,  and  biconnected  components. 

The  goal  of  this  paper  is  to  achieve  similar  results  (i.e. ,  optimal  algorithms) 
for  a  parallel  RAM  model,  but  we  will  utilize  quite  different  techniques  (i.e., 
randomization). 

1 . 2  Known  Parallel  RAM  Algorithms 

The  performance  of  a  parallel  algorithm  can  be  specified  by  bounds  on  its  prin¬ 
cipal  resources:  processors  and  time.  We  generally  let  P  denote  the  processor 
bound  and  T  denote  the  time  bound.  For  most  nontrivial  problems  such  as  sorting 
and  the  above  graph  problems,  the  product  P*T  is  lower  bounded  by  at  least  a  constant 
times  the  input  size.  Thus  for  these  problems,  we  consider  a  parallel  algorithm  to  be 
optimal  if  P*T  =  0(input  size).  For  example,  given  a  graph  of  n  vertices  and  m 
edges,  a  parallel  graph  connectivity  algorithm  is  optimal  if  P-T=0(n+m).  Of  course, 
if  we  have  an  optimal  algorithm  with  any  processor  bound  P,  then  we  also  have  (by  the 
obvious  processor  simulation)  an  optimal  algorithm  for  any  processor  bound  P',  where 
P^P'^l.  Hence  an  optimal  algorithm  may  also  be  useful  in  practical  situations 
where  we  have  a  limited  number  of  processors. 


-2- 


We  assume  a  parallel  RAM  model  of  [Shiloach  and  Viskin,  81] .  The  processors 
are  synchronous,  and  each  is  a  unit  cost  sequential  RAM  which  in  a  single  step  may 
either  read  or  write  into  a  memory  cell  or  register,  or  perform  an  arithmetic  opera¬ 
tion  on  an  integer.  Each  memory  cell  and  register  may  contain  at  most  a  logarithmic 
number  of  bits  in  the  input  size.  This  parallel  RAM  model  allows  multiple  reads  at  a 
single  memory  cell  and  also  allows  multiple  writes  at  a  single  memory  cell,  where 
multiple  writes  are  allowed  to  be  resolved  arbitrarily.  This  model  is  known  as  the 
CRCW  parallel  RAM  and  is  quite  robust,  see  iKucera,  82]  for  its  relation  to  other 
parallel  machine  models.  In  addition  we  allow  each  processor  an  independent  rartdom 
number  generator. 

There  are  a  number  of  known  algorithms  for  sorting  in  logarithmic  time  using  a 

linear  number  of  processors;  for  example  [Reischuk,  82]  gives  a  randomized  parallel 

1/2 

RAM  algorithm  (which  unfortunately  requires  memory  cells  of  n  bits  each) .  iReif 
and  Valiant,  83]  give  a  randomized  parallel  algorithm  (which  has  only  moderate  constant 
bounds  and  requires  memory  cells  of  O(log  n)  bits  each),  and  [.Ajtai,  Komlbs,  and 
Szemeredi  83;  and  Leighton,  84]  give  a  deterministic  parallel  algorithm.  This  last 
result  of  [Leighton,  84]  appeared  to  finally  settle  the  problem  of  parallel  sorting 
since  PT  =  ..  (n  logn)  is  a  known  lower  bound  in  the  case  of  comparison  sorting.  How¬ 
ever,  these  lower  bounds  on  PT  need  not  hold  for  integer  sorting:  sorting  n  integers 
on  the  range  In]*  (note  that  the  restriction  to  the  range  [n]  is  natural,  since  RAM 
memory  cells  can  only  contain  numbers  with  at  most  a  logarithmic  number  of  bits.) 
Integer  sorting  is  all  that  is  required  for  most  practical  applications  of  interest, 
for  example  for  putting  a  list  of  edges  into  adjacency  list  representative  by  sorting 
the  edges  by  the  vertices  from  which  they  depart.  On  the  other  hand,  an  optimal 
integer  sort  is  essential  in  the  derivation  of  any  optimal  parallel  graph  algorithm 
which  requires  the  edges  to  be  put  in  adjacency  list  representation. 


Note  throughout  this  paper,  we  let  [n]  denote  11, 
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Previously  T=  O(log  n)  time  bounds  and  simultaneous  P =  n+m  processor  bounds 
have  been  given  for  connected  components  [Shiloach  and  Vishkin,83]  and  spanning  trees 
[Averbuch  and  Shiloach,  83]  of  graphs  with  n  vertices  and  m  edges.  All  these 
previous  algorithms  had  a  PT =  (n+m) log  n)  bound,  which  was  a  logarithmic  factor 

more  resources  than  optimal  for  logarithmic  time  bounds.  [Tarjan  and  Vishkin,  83] 

pose  as  an  open  problem  to  find  optimal  parallel  graph  algorithms. 

In  fact  no  optimal  graph  searching  method  has  been  proposed  for  parallel  RAM, 

for  any  sublinear  time  bounds,  except  in  the  special  case  where  the  graph  is  extremely 

2 

dense  (i.e.,  m=fi(n  )).  [Chin,  Lam  and  Chen,  82]  and  [Vishkin,  81], 

2  2  2 
both  give  O(log  n)  time  connectivity  algorithms  requiring  (n  +m)/(log  n) 

2 

processors,  which  is  optimal  only  if  m=  ft(n  ). 

Vishkin  conjectured  that  randomized  techniques  would  be  needed  to  get  optimal 

parallel  graph  connectivity  algorithms.  Indeed  the  literature  contains  some  interesting 

attempts  to  use  randomization  to  derive  optimal  parallel  algorithms  for  graph  problems. 

For  example  [Vishkin,  84]  recently  gave  a  randomized  algorithm  for  finding  the  number 

of  successors  on  a  linear  list  which  used  an  optimal  number  of  processors  with  an  almosl 

logarithmic  time  bound.  (However,  Vishkin' s  algorithm  assumed  an  oracle  which  provided 

a  random  permutation,  but  he  provided  no  efficient  method  for  parallel  construction  of 

random  permutations.)  Also  [Reif,  84]  gave  a  randomized  parallel  graph  algorithm  which 

2 

had  optimal  processor  bounds  only  for  graphs  with  m^n(log  n)  edges. 

1 . 3  Our  optinal  Parallel  RAM  Algorithms 

Our  main  results  are  optimal  randomized  parallel  RAM  algorithms: 

(1)  O(log  n)  time,  n/log  n  processor  algorithms  for  integer  sorting 

(2)  O(log  n)  time,  (m+n)/log  n  processor  algorithms  for  connected  components 
and  spanning  forests  for  any  graph  of  n  vertices  and  m  edges. 

Here  0  denotes  that  the  upper  bound  holds  within  a  constant  factor  with  over¬ 
whelming  likelihood,  for  the  worst  case  input.  In  particular,  we  let  T(n)  =  0(f(n)) 
denote  3c  V  aM,  V  sufficiently  large  n,  T(n)  <cdf(n)  holds  with  probability  at 
least  1  -  l/na. 
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Our  integer  sorting  algorithm  is  quite  easy  to  implement  and  may  be  of  some 
practical  use,  since  it  has  very  moderate  constant  factors. 

1 .  k  Organization  of  This  Paper 

In  Section  2,  we  give  a  known  optimal  algorithm  for  parallel  prefix  computation 
which  will  be  of  some  use  in  devising  our  optimal  parallel  algorithms. 

In  Section  3,  we  give  our  optimal  parallel  algorithm  for  integer  sorting,  which 
achieves  its  efficiency  by  some  interesting  new  randomization  techniques.  As  an 
immediate  consequence,  (see  Appendix  A3)  we  get  an  optimal  parallel  algorithm  for 
computing  a  random  permutation. 

In  Section  4  (and  Appendix  A4)  we  give  oui  algorithm  for  graph  connectivity.  It 

is  derived  in  stages  where  we  consider  graphs  of  decreasing  density.  We  first  give 

a  simple  logarithmic  time  algorithm  called  RANDOM-MATE,  which  is  nonoptimal,  but 

utilizes  randomization  in  an  essential  and  new  way.  We  next  modify  this  algorithm 

2 

so  that  it  is  optimal  for  graphs  of  n  vertices  with  at  least  m^n(log  n)  edges. 

Then  we  give  efficient  parallel  reductions  from  various  cases  of  sparse  graphs 

2 

to  the  case  m^n(log  n)  . 

In  the  Appendix  Al  we  give  some  useful  upper  bounds  for  the  tails  of  various 
probability  distributions  which  arise  in  the  analysis  of  our  algorithms. 

In  a  separate  paper  we  give  applications  of  our  optimal  parallel  graph  connec¬ 
tivity  algorithm  to  finding  Euler  cycles,  biconnected  components,  and  minimum 
spanning  trees. 

2.  PARALLEL  PREFIX  COMPUTATION 
2 . 1  P ref i x  Ci rcu i ts 

Let  D  be  a  domain  and  let  o  be  an  associative  operation  which  takes  0(1) 
sequential  time  over  this  domain.  The  prefix  computation  problem  is  defined  as  follows 
input  X(l) , . . . ,X(n)  €  D 

output  X(l) ,X(1)  o  X(2) , . . . ,X(1)  o  ...  o  X (n) . 
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iLadner  and  Fischer,  80]  show  prefix  computation  can  be  done  by  a  circuit  of 
size  n  and  depth  0(log  n) . 

Known  techniques  attributed  to  Brent,  give  the  following  processor  improvement: 

LEMMA  2.1.  Prefix  computation  can  be  done  in  time  O(log  n)  using  n/log  n  F-FJJt 
processors. 

The  prefix  sum  computation  problem  is  defined  as  follows:  Given  input  integers 
X ( 1 ) , . . . ,X(n)  €  [n] ,  output  the  vector  PREFIX-SUM (X)  =  (Y  (0) ,Y (1) , . . . ,Y  (n) )  where 
Y  (0)  =  0  and  Y(i)  =  'F...  X(j)  for  i£  [n] .  By  Lemma  2.1,  we  can  do  this  computation 
in  time  O(log  n)  using  n/log  n  processors. 

3.  AN  OPTIMAL  PARALLEL  SORTING  ALGORITHM 

3 . 1  Known  Sorting  Algorithms 

The  integer  sorting  problem  of  size  n  is  defined: 

input  keys  k, , . . . ,k  €  [n] 

In 

output  permutation  o=  (0 (1) , . . . ,0 (n) )  such  that  (g)  ^ ^ (n) • 

The  input  keys  k  ,  ...,k  are  not  necessarily  distinct.  By  use  of  the  well  known 
1  n 

and  quite  practical  BUCKET-SORT  algorithm  [Aho,  Hopcroft,  and  Oilman,  74] , 

LEMMA  3.1.  Integer  sorting  can  be  done  in  time  oln)  by  a  deterministic  sequential  FLAM. 

Any  comparison  based  sort  requires  PT=  C(n  log  n) ,  and  the  best  known  parallel 
sorts  actually  achieve  these  bounds.  In  particular,  [Reif  and  Valiant,  83]  show 
LEMMA  3.2.  n  keys  can  be  sorted  in  time  5 (log  n)  using  n  processors  in  a  constant 
degree  network. 

This  algorithm  uses  memory  cells  of  O(log  n)  bits.  It  can  also  be  implemented 
by  the  randomized  P-RAM  model.  In  addition,  [Ajtai,  Koml6s,  and  Szemerhdi,83]  , 

[Leighton,  84]  give  a  deterministic  sorting  network  which  takes 
C(log  n)  time  with  0(n)  processors.  In  the  following,  we  prove: 

THEOREM  3-1.  Integer  sort  can  be  done  in  time  5(log  n)  using  n/log  n  F-RAPl 
processors. 


o  .  •  .  -  .  - . - .  -  .  - .. 

V  V  •»- 
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We  will  achieve  PT=0(n)  for  integer  sorting,  making  essential  use  of  the 
fact  that  the  input  keys  k  , ...,kn  are  integers  in  [n]  as  in  the  case  of  all  our 
graph  applications.  We  would  be  quite  surprised  if  any  purely  deterministic  methods 
yield  PT=0(n)  for  parallel  integer  sort  in  the  case  of  time  bounds  T=0(log  n) . 
Although  we  will  use  deterministic  methods  to  solve  some  restricted  integer  sorting 
problems,  (see  Lemmas  3.4  and  3.5  below)  our  optimal  parallel  algorithm  for  the 
general  integer  sorting  problem  requires  some  interesting,  new  use  of  randomization 
techniques  (see  Lemmas  3.6  and  3.7). 

3- 2  Easy  Integer  Sorting  Problems 

Given  a  sequence  of  keys  k. ,...,k  £  [n]  let  the  key  index  sets  be  I (k)  = 

In 

{i|k^  =  k}  for  each  key  value  k£  In].  We  will  assume  log  n  divides  n. 

LEMMA  3-3-  Given  I (1) , . . . , I (r) ,  we  can  sort  k,  ,...,k  in  O(log  n)  time  usina 

in 

P  =  n/log  n  processors. 

Proof.  See  Appendix  A3. 

A  sorting  algorithm  is  stable  if  given  k, ,...,k  ,  the  algorithm  outputs  a 

1  n 

permuation  0  of  (l,...,n)  where  Vi,  j£  [n]  if  k^  =  k j  and  i<j  then 
0(i)  <  a(j) . 

LEMMA  3.4.  A  stable  sort  of  n  keys  k  ,...,k  £  [log  n]  can  be  computed  in  O(log  n) 
time  using  P=  n/log  n  processors. 

Proof .  See  Appendix  A3. 

2 

LEMMA  3-5.  n  keys  k^.^jk^E  [(log  n)  ]  can  be  sorted  in  0{log  n)  time  using 
P  =  n/log  n  processors. 

Proof.  See  Appendix  A3. 

Note :  We  can  similarly  extend  Lemma  3.5  to  apply  to  key  values  in  [(log  n)°^]. 

o 

3-3  Randomized  Sampling  and  Sorting  in  Key  Domain  [n/(log  n)  ] 

2 

In  the  following  subsection,  we  fix  a  key  domain  [D]  where  D=n/(log  n)  . 

2 

(We  assume  (log  n)  divides  n) .  Let  the  input  keys  be  k,  ,...,k  £  [D]  and 

1  n 

their  index  sets  be  I (k)  =  {i | k^ =  k }  for  each  key  value  k£  [D] . 


LEMMA  3-6.  Given  as  input  k  , ...,k  £  [d],  we  can  compute  N  (1) , . . .  ,N  (D)  in 
0{log  n)  time  using  P  =  n/log  n  processors ,  such  that  N(k)<0(n)  and 

furthermore  with  high  likelihood  (in  fact  with  probability  >1-  l/na  for  any  given 
a>l)  N(k)>|l(k)|  for  each  k£[D]. 

As  proof,  we  execute  the  following  randomized  sampling  algorithm 
Step  1  for  each  processor  tt  £  [P]  in  parallel  do 
do  choose  a  random  s^  £  [n]  od 
S+-  {s^ ,  •  • .  ,sp} 

Comment.  Here  we  randomly  choose  a  set  S  cr  [n]  of  P  key  indices. 

Step  2  Sort  k  ,...,k  and  compute  index  set  1^  (k)  =  {i£s|k.=k} 

Sx  sp  S  X 

for  each  key  value  k£  [D]. 

Comment.  Applying  Lemma  3.2,  this  sorting  can  be  done  by  known  parallel 
algorithms  in  O(log  n)  time  using  P  processors. 

Step  3  for  each  k  £  [D]  do 

N(k)  dQ  (log  n)  ( |  (Jc)  |  +  log  n) 

Comment.  .  dQ  is  a  constant  to  be  determined  in  the  probabilistic  analysis. 
output  N(l) , . . . ,N(D) . 

See  Appendix  A3  for  a  proof  of  the  probabilistic  bounds  given  in  Lemma  3.6. 

2 

Lemma  3-7.  n  keys  k  , . . .  ,k  £  ID] ,  ( where  D=n/(log  n)  )  can  be  sorted  in 
-  1  n 

O(log  n)  time  using  p =  n/log  n  processors. 

Proof.  (We  will  actually  use  0(P)  processors,  but  we  observe  that  we  can  then  slow 
the  computations  down  by  a  constant  factor  to  reduce  the  processor  bound  to  P.)  Our 
randomized  algorithm  is  given  below. 

Step  1  Compute  N (1)  , . . .  ,N (D)  as  defined  in  Lemma  3.6. 

Comment.  Here  we  use  the  random  sampling  algorithm  of  Lemma  3.6. 

Step  2  (N (0)  , . . .  ,N(D) )  «-  PREFIX-SUM (N (1)  , .  . .  ,N(D) ) 

Comment.  This  prefix-sum  computation  is  done  by  Lemma  2.1  in  0(log  n)  time 


and  O(P)  processors. 
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Step  3  for  each  key  value  k  6  [D] 

do  P  {it  1 71 6  [D]  or  N  (k-1)  +  D  <  n  ^  5  (k)  +  d}.  Using  these  P 

A  K 

(N{k) ) , A  {N (k)+l) ) ) 

and  initialize  each  element  of  the  table  to  be  an  empty  list, 
od 


processors,  construct  a  table  A^=  (A^  (1) ,  A^  (2) , . . .  ,A^ 


Step  4  for  each  tt  £  [P]  in  parallel  do 

for  each  t=l,...,log  n  sequentially  do 

i  +■  (77-1)  log  n+t 
77 

choose  a  random  number  r^  £  [N(k)] 

attempt  to  add  i  to  front  of  list  At .  (r  ) 

77  77 

if  successful  (i.e.,  i  is  now  in  front  of  list  Au.  (r  ) ) 
—  77  V 


then  CONFLICT  (i  )  +-  0  else  CONFLICT  (i  )  •+ 1  if 
-  77  -  77  - 


od  od 


Comment.  Each  processor  77 £  [P]  is  responsible  for  keys  i)i0g  n+l,’*''IC77  log  n' 

The  inner  loop  for  t=l,...,log  n  is  executed  sequentially  so  as  to  minimize 

conflicts.  In  the  t-th  iteration  of  the  inner  loop,  processor  77  attempts  to  add 

the  index  i  =  (77-1)  log  n+t  of  the  key  k-!  to  the  front  of  list  Av.  (re)  where 
77  3  J  -*-77  *1^  i77 

r^  is  a  randomly  chosen  integer  in  [N(k)].  This  may  not  be  successful  if  some  other 

processor  77'  simultaneously  attempts  to  add  some  other  index  i^,  to  the  front  of 

list  A.  .  (rj  ).  Only  one  addition  to  this  list  will  succeed.  But  this  conflict  will 
KItj  177 

only  happen  in  the  case  k^,  =  k^  and  111  makes  the  same  unlucky  choice  of  r^,  =  r^. 
Claim  3- 1  •  Let  n'  =  CONFLICT(i).  Then  n'  <  0(P).  In  particular,  3c  Va^l 

Prob  (n'  <a  cn/log  n)  >  1  -  l/na. 


Proof.  See  Appendix  A3. 
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step  5  (u  (0) , . . .  ,u  (n) )  PREFIX-SUM  (CONFLICT  (1)  , . .  . ,  CONFLICT  (n) ) 
n'  «-u  (n) 

for  each  ire  [p]  in  parallel 

do  for  each  t=l,...,log  n  sequentially 

do  i  «-  (TT-l)log  n+t 
-  IT 

if  CONFLICT (i^)  then  ju(i  ^  \  fi 

IT 

od  od 


Comment.  (j  ,...,jn,)  is  the  list  of  indices  j  such  that  CONFLICT* j) =  1.  Again, 
the  prefix  computations  can  be  done  by  applying  Lemma  2.1. 


Step  6.  Sort  kj^,...,kjn,  and  for  each  key  value  k£  [D]  assign 


Comment, 
value  k. 


A^  (N (k)+l) )  {j^|k=kj  }. 


Assuming  n'<0(P),  this  step  can  be  done  by  known  parallel  sorting  algo- 


righms  in  time  0(log  n)  using  P  processors. 


Step  7.  for  each  key  value  k  €  [D] 

do  Construct  table  A£  consisting  of  a  list  of  all  the  elements  of 
the  lists  A^  (1 )  ,  Ak(2),...,Ak(N(k)),Ak(N(k)+l). 
od 


Comment.  This  is  done  in  0(log  n)  time  by  careful  use  of  the  processor  set  P^.  In 

particular,  we  first  compute  (a^fO),  —  ,ak  (N(k)+1) ) +-P  REFIX- SUM  (  |a^  (1)  | ,  J  (2 )  |, 

I (k) ) | , | A^ (N (k) +1)  | ) .  Note  that  | A^ (i ) j  ^ log  n  for  each  i.  Hence  for  each 

i =  1, • • . »N(k)+l  in  parallel  we  can  place  the  elements  of  A  (i)  into  locations 

k 

A^  (ak  (i-1) +1) )  , .  . .  ,  A' (i) )  using  a  single  processor  tt  6  P^  with  time  O(log  n) . 
Step  8.  Compute  a  permutation  c  of  (l,...,n)  such  that  the  elements  of  A^,...,Ap 

appear  in  order. 


Comment.  We  apply  here  Lemma  3.3. 
output.  0=  (a (1) , . . .  ,a (n) ) . 

The  total  time  for  steps  1-8  is  6 (log  n)  using  P  processors. 


Finally,  we  prove  Theorem  3.1,  by  combining  the  above  techniques.  (We  again 
2 

assume  (log  n)  divides  n.) 

Input  keys  ^ 

2 

Step  1  Assign  k!  =  fk./dog  n)2l+l  and  k"  =  k  -  (k' -1)  (log  n)  +1  for  each  i  £  (n) 

2  2 
Comment,  k' , . . .  ,k'  £  [D]  where  D=n/(logn)  and  k"  , . . .  ,k"  £  [  (log  n)  ] 

■ - in  x 

Step  2  Sort  k' , . . . ,k*  €  [D]  resulting  in  index  sets  I ' (k)  =  (i |k£ =  k}  for  each 
- c —  1  n 

key  value  k£  [D] 

Comment.  This  is  done  by  applying  Lemma  3.7. 

Step  3  Sort  {kV ji£  I’ (k) }  c  [(log  n)2]  yielding  ordered  list  L(k)  of  indices 
in  I'  (k)  for  each  key  value  k£  [D] 

Comment.  This  is  done  by  applying  the  stable  sort  of  Lemma  3.5  to  the  ordered  list 
of  keys  I*  (1) . . .1' (D) . 

Step  4  Compute  the  permutation  0  which  orders  the  indices  as  L(l) , . . . ,L(D) 

Comment.  Here  we  apply  Lemma  3.3,  0  satisfies  k0 (i)  ^ ^ ko (n) 

output  o 

The  Lemmas  3. 2-3. 7  and  the  appropriate  use  of  prefix-sum  computation  (Lemma  2.1)  imply 
that  each  step  can  be  done  in  O(log  n)  using  P =  n/log  n  processors.  D 

3. 5  Optimal  Parallel  Generation  of  a  Random  Permutation 

COROLLARY  3.1.  A  random  permutation  a  of  (1, — ,n)  can  be  constructed  in 
0(log  n)  time  using  P =  n/log  n  P-EAM  processors. 

Proof .  See  Appendix  A3. 
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*»•  OPTIMAL  PARALLEL  GRAPH  ALGORITHMS 

Given  a  graph  G,  let  CC(G)  be  the  connected  components  of  G.  We  prove  in 
this  section: 

THEOREM  4.1.  For  any  graph  G  with  n  vertices  and  m  edges  we  can  compute  CC(G) 
in  O(log  n)  time  using  (m+n)/log  n  parallel  RAM  processors. 

(Note:  Simple  modifications  of  our  algorithms  also  give  a  spanning  forest  of  G 
within  the  same  resource  bounds.) 


The  proof  of  Theorem  4.1  will  be  separated  into  three  cases  of  decreasing  density 
of  edges.  In  each  case,  we  efficiently  reduce  the  connected  components  problem  to  one 
for  a  denser  graph.  The  density  reductions  use  various  randomized  sampling  techniques 


(see  details  in  Appendix  A4) . 
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4. 1  A  New,  But  Nonoptimal  Randomized  Algorithm 

We  begin  by  describing  a  new  randomized  algorithm  RANDOM-MATE  for  computing  CC(G) 
of  G  =  (V, E)  with  n  vertices  V=  {l,...,n}  and  m  edges  E.  We  will  associate  a 
distinct  processor  with  each  vertex  of  V  and  each  edge  of  E.  This  algorithm  will 
be  nonoptimal  since  it  runs  in  O(log  n)  time  using  n+m  processors  as  did  previous 
parallel  graph  connectivity  algorithms  [Shiloach  and  Vishkin,  83].  However,  RANDOM-MATE 
has  the  advantage  (not  shared  by  the  previous  deterministic  algorithms)  that  it  can  be 
modified  to  an  optimal  algorithm,  as  we  prove  in  the  Appendix  A4. 

Our  randomized  connectivity  algorithm  will  be  motivated  by  the  following 
LEMMA  A.l.  (The  Random  Mating  Lemma)  Let  G=  (V,E)  be  any  graph.  Suppose  for  each 
vertex  v€v*  we  randomly ,  independently  assign  SEX(v)  E  {male,  female).  Let  vertex 
v  be  active  if  there  exists  at  least  one  departing  edge  (v,u)€e  where  u/v,  and 
let  vertex  v  be  mated  if  sex (v)  =  male  and  sex (u)  =  female  for  at  least  one  edge 
{v,u} E  e.  Then  with  probability  1/2  the  number  of  mated  vertices  is  at  least  1/8 
of  all  active  vertices. 

Proof.  See  Appendix  A4. 

To  represent  collapsed  subgraphs,  we  use  an  array  R  which  we  view  as  pointers 
mapping  V-*-V.  Let  the  graph  collapsed  by  R  be  defined  R(G)  =  (R(V),R(E))  where 
R (V)  =  {r (v) |v€  v)  and  R(E)  =  { (R(v) ,R(u) )  |  {v,u } €  E,  R(v)  /  R(u) }.  Each  vertex  r  €  R(V) 
is  named  a  R -root.  Our  algorithm  below  (and  the  ones  to  follow)  will  always  satisfy 
R(R(v))  =  R(v)  for  each  v€v.  Hence  the  R  pointers  define  a  directed  forest 
(V, { (v, R (v) ) |  v 6  V  -  R(V) }) .  Each  tree  in  this  forest  will  be  called  a  R -tree;  it 
will  have  height  ^1  and  will  consist  of  a  maximal  set  of  vertices  of  V  mapped 
to  the  same  R-root. 

Initially  we  set  R(v)  =v  for  all  v  E  V.  We  will  prove  that  at  the  end  of  the 
algorithm  the  vertices  of  R-trees  are  the  connected  components  CC(G). 

We  execute  the  main  loop  c^  log  n  times,  where  is  a  constant  defined  in  the 

proof  below.  On  each  execution  of  male,  we  merge  together  connected  subgraphs  by  a 
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randomly  assigning  R- roots  male  or  female  with  equal  probability,  and  then  letting 
each  R-root  assigned  male  to  be  merged  into  a  R-root  assigned  female ,  if  there  is  an 
edge  between  those  corresponding  subgraphs.  Note  that  we  can  view  this  a  mating 
process  where  each  male  may  be  mated  and  merged  into  at  most  one  female  but  many  males 
may  merge  into  the  same  female. 

It  will  be  useful  to  define  D(E)  =  { (v,u)  |{v,u}£e}U  {  (u,v)  j  {v,u}  £  e}  to  be  the 
directed  edges  derived  from  E. 
algorithm  RANDOM-MATE 

input  graph  G=  (V,E)  with  n=  |v|  and  m  =  |e|. 
initialize  for  each  v£  V  in  parallel  do  R(v)  *■  v  od 
main  loop;  for  t  =  1, . . . ,c  log  n 
do 

assign  sex;  for  each  v£  V  in  parallel  do 
if  R(v)  =  v  then 

comment  v  is  currently  a  R-root 
randomly  assign  SEX(v)  £  {male , female} 
fi  od 

merge :  for  each  (v,u)  £  D (E)  in  parallel  do  MATE(v,u) 
collapse :  For  each  v£v  in  parallel 
do  R (v)  «-  R(R(v) ) 

comment  collapse  the  R-trees  to  depth 
od  od 

output  R(l) , . . . ,R(n) 

Also  we  define 
procedure  MATE(v,u) 

if  SEX (R (v) )=  male  and  SEX(R(u))  =  female 
then  R (R(v) )  ■*-  R(u)  fi 

comment  attempt  to  mate  male  R-root  R(v)  with  female  R-root  R(u). 
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C1 a i m  4.1.  The  vertex  set  of  each  R-tree  is  always  within  a  single  connected 
component  of  CC(G). 

Proof.  See  Appendix  A4. 

Note  RANDOM-MATE  may  have  incorrect  output  if  after  c^  log  n  iterations,  there 
still  exists  an  active  R-root.  But  the  main  body  can  easily  be  altered  to  test  if 
3{v  ,u}€  E  such  that  R(v)  ^  R(u)  and  if  so,  go  back  to  the  main  loop. 

RANDOM-MATE  then  yields  the  following  (nonoptimal)  result: 

LEMMA  4.2.  For  any  graph  G  with  n  vertices  and  m  edges,  we  can  compute  CC(G) 
in  time  O(log  n)  using  m+n  processes. 

Proof.  See  Appendix  A4. 

4. 2-4.4  Optimal  Parallel  Algorithms  for  Various  Edge  Densities 

We  hope  our  careful  description  of  RANDOM-MATE  has  interested  the  reader  enough 
to  read  the  proof  of  Theorem  4.1  given  in  the  Appendix.  The  proof  is  broken  into 
three  cases: 


(1) 

m  >  n  (log 

n)2 

(2) 

m  n  (log 

n)1/3 

(3) 

m  <  n (log 

n)1/3 

Cases  (1)  and  (2)  apply  random  sampling  techniques  and  various  modified  and 
improved  forms  of  RANDOM-MATE  which  use  (m+n)/log  n  processors.  Case  (3)  uses 
a  variant  of  RANDOM- MATE  with  a  randomized  conflict  resolution  technique  similar 
to  the  conflict  resolution  techniques  used  in  our  integer  sorting  algorithm.  The 
details  are  found  in  Appendix  A4. 
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APPENDIX  At  :  Probabilistic  Bounds 

The  randomized  algorithms  in  the  preceding  sections  are  analyzed  by  applying  the 
following  probabilistic  bounds  on  the  tails  of  binomial  and  hypergeometric  distribu¬ 
tions  (see  also  [Feller,  80]). 

Let  random  variable  X  upper  bound  random  variable  Y  (and  Y  lower  bound  X) 
if  for  all  x  such  that  O^x^l,  Prob (X <  x)  ^  Prob  (Y  ^ x) . 

A1 . 1  Binomial  Distributions 

A  binomial  variable  X  with  parameters  n,p  is  the  sum  of  n  independent 
Bernoulli  trials,  each  chosen  to  be  1  with  probability  p  and  0  with  probability 
1-p.  The  binomial  distribution  function  is  Prob(X<x)  =  I*  (£)  pn(l-p)n_k. 

The  bounds  of  [Chernoff,  52]  and  [Angluin  and  Valiant,  79]  imply 
LEMMA  A1 . 1 .  Ve,p,n  where  0<p<l  and  0  <e  <1, 

Prob  (X ^  i(l-e)pnj)<  exp(-e2np/2) 

Prob(X  3*  f  (l+e)np1)  <  exp(-e2np/3)  . 

LEMMA  A1.2.  [Hoeffding,  56].  Let  x^, _ ,  X^  be  independent  binomial  variables.  Then 

I^=1  x±  is  upper  bound  by  a  binomial  variable  with  parameters  n,p  with  mean 

np  =  Z?  mean(Xj. 

A1 . 2  Hypergeometric  Distributions 

Fix  p,s  where  0<p<l  and  O^s^n.  Let  A  be  a  subset  of  (l,...,n)  of 
size  np.  A  hypergeometric  variable  Y  with  parameters  s,np,n  is  defined  as 
Y=  | S  n  A |  where  S  is  a  random  sample  of  s  elements  of  {l,...,n}  chosen 
without  replacement. 

Suppose  we  independently  choose  s  ^n  random  integers  r r  E{l,...,n}.  Let 

1  s 

index  i  be  the  conflicted  if  3  distinct  a,b  such  that  r  =  r.  =  i.  Let  Z  be 

3  fc> 

the  total  number  of  conflicted  indices  i€  {l,...,n}. 

LEMMA  A1.3.  Z  is  upper  bounded  by  a  hypergeometric  variable  with  parameters  s,s,n. 
[Johnson  and  Katz,  69]  attribute  the  following  bound  to  Uhlmann. 
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LEMMA  A1.4.  If  X  is  binomial  with  parameters  s,p  and  Y  is  hypergeometric  with 
parameters  s,np,n  then 


Prob (X ^  x)  > Prob (Y < x)  for 


0  <  p  <= 


nx 

(s-1)  (n+1) 


and 


Prob(X^x)  >Prob(Y<x)  for 


(1+nx/  (s-1))  ^ 
(n+1) 


P 


<  1  . 


APPENDIX  A3:  Proof  of  Parallel  Sorting  Algorithms 


Proof  of  Lemma  3-3-  Compute  =  PREFIX-SUM ( | I (1) | , . . . , | I (n) | )  in  O(logn) 


using  P  processors  by  Lemma  2.1.  We  then  set  0( h,  +1) , . . . ,0  (n.  )  to  consecutive 

K 


k  <...<k  .  ,  is  a  sort. 
0(1)  0(n) 


elements  in  I  (k)  using  a  total  of  0(log  n)  time  and  P  processors  (the  required 
processor  assignment  can  easily  be  done  by  using  the  prefix  sum  computation.)  Then 

□ 

Proof  of  Lemma  l.b.  To  each  processor  7T  €  [P]  ,  we  assign  key  indices  J(it)  = 
{j|(Tt-l)log  n<j  ^min  (n,Ti  log  n)  }.  Let  each  processor  7T  sequentially  sort  the  keys 
{k .  I  j  €  J(tt)  }  by  BUCKET-SORT  in  time  0(log  n)  ,  and  so  compute  each  list  J 


Tt ,  k 


(j  E  J  (tt)  |k  =  k)  in  increasing  order  of  indices  for  each  key  value  k€  [log  n]  .  Then 


for  each  key  value  k£  [log  n]  we  compose  the  lists  J  . ..J  to  form  the  list 

1 f  K  r  fK 


I  (k)  of  indices  with  key  value  k.  Finally,  we  apply  Lemma  3.3  to  compute  the  required 
permutation  o  ordering  the  indices  as  they  appear  in  I  (1) , . . . , I  (P) .  The  total  time 
is  0(log  n)  using  P  processors.  o 

Proof  of  Lemma  3.5-  Let  k,'  =  fk./log  nl+1  and  let  k'.’  =k.  -  (k!-l)loq  n+1  for  each 

-  i  l  ill 


i€  [P] .  We  first  apply  Lemma  3.4  to  get  a  sort  of  k',...,k',  yielding  a  permutation 

1  n 


o.  Then  we  apply  Lemma  3.4  again  to  get  stable  sort  of  (p)  ' '  *  * (n) '  yielding  a 


permutation  O’.  Then  (i)  ^  ^  (n)  '  and  hence  O'  is  a  sort  of  k^,...,^. 


(n) 


Proof  of  Lemma  3- If  dQ (log  n)  ^  1 1  (k)  |  then  always  N (k)  >  dQ (log  n)  ^  1 1  (k) | . 


2 

Else  suppose  d  (log  n)  <  1 1  (k) |.  |l  (k) |  is  upper  bounded  by  a  binomial  variable 


with  parameters  n/log  n,  j I  (k )  j/n.  The  Chernoff  bounds  given  in  Appendix  A.l, 


-1 


Lemma  Al.l,  imply  Be  Vex  >1  if  c  =  (ccx)  then 
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Prob  ( | Ic (k) | >  d^ I  (k) |/log  n) ^ l-l/na.  Since  N(k)  ^  d  1 1  (k)  |  log  n,  the  probability 


“S . .  0 

bounds  hold  as  claimed. 


P roof  of  Claim  3  •  J  ■  By  Lemma  3.6,  with  likelihood  ^1-1/n  ,  we  can  assume 


N  (k)  >  !  I  (k)  I .  Let  n,  =  I . , 


CONFLICT (i).  The  key  observation  is  that  on  each 


‘k  -i£N{k) 

stage  t,  1/log  n  of  the  key  indices  of  I (k)  are  assigned  to  random  positions  of 


the  table 


V 


Let  n  be  the  number  of  indices  i  £N(k)  where  CONFLICT(i) 
k,t 


is  set  to  1  on  stage  t.  Then  by  definition  n^  n  n^_ 

We  now  apply  the  probabilistic  bounds  given  in  Appendix  A.  1,  and  we  consider  upper 

bounds  on  probability  variables  to  be  over  the  range  of  probability  densities  from 

a  a 

1/n  to  1-  1/n  .  By  Lemma  A1.3,  each  n.  is  upper  bounded  by  a  hypergeometric 

3 '  t 

variable  with  parameters  |l(k)|/log  n,  |l(k)|/log  n,  |l(k)|.  Then  Lemma  A1.4  implies 

each  n  is  upper  bounded  by  a  binomial  variable  with  parameters  N(k)/log  n,  1/log  n. 
k ,  t 

Hence  by  (Hoeff ding’s  inequality)  Lemma  Al.2,  n  =  n  n  is  upper  bounded  by  a 

binomial  variable  with  parameters  N(k),  1/log  n.  Furthermore  fDj  N(k)^0(n),  so 

^k€[D]  nk  :*'s  upper  bounded  (by  Hoeffding’s  inequality)  by  a  binomial  variable  with 

parameters  0(n) ,  1/log  n.  The  Chernoff  bounds  given  in  Lemma  Al.l  immediately  imply 

the  claimed  probabilistic  bounds  on  n*.  D 

Proof  of  Corol 1  ary  3 • 1 ■  We  execute  the  following  algorithm. 

Step  1  for  each  processor  tt  £  [P]  in  parallel 

do  for  each  t  =  1 , . . . , log  n 

do  i  *■  (TT-l)log  n+t 
-  TT 


randomly  chose  k.  £  [P] 


od  od 


Step  2  Sort  k„,...,k  and  compute  I(k)  =  {ilk.  =  k}  for  each  key  value  k£  [P] 
- 4 -  In  l 

Comment.  The  sort  can  be  done  by  Lemma  3.1  in  time  6 (log  n)  using  P  processors. 
CLAIM  3-2.  With  high  likelihood,  ]l(k)|<0(log  n)  for  each  k£  IP).  In  particular 
3c  Vu  >  1  Prob  ( 1 1  (k)  [<  ccx  log  n)  >  1  -  l/n0. 

Proof .  Each  j I (k) |  is  upper  bounded  by  a  binomial  variable  with  parameters  n, 


log  n/n.  Hence  the  claimed  bounds  follow  from  the  Chernoff  bounds  of  Lemma  Al.l.  □ 
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Step  3  for  each  it  €  [p]  in  parallel 

do  let  L(k)  be  a  random  permutation  of  the  elements  of  I (k)  od 
Comment.  A  random  permutation  I {k)  can  easily  be  sequentially  computed 
in  O ( | I (k )  j )  time  by  a  single  processor. 

Step  4  Compute  0=  (0 (1) , . . . , a (n) 5 ,  the  permutation  of  (l,...,n)  which 
gives  the  order  of  appearance  of  the  indices  in  L(l) , . . . ,L(P) . 
Comment.  This  can  be  done  in  0(log  n)  time  by  Lemma  3.3. 
output  random  permutation  of  0. 

The  total  time  for  the  steps  1-4  is  6 (log  n)  using  P  processors.  O 
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APPENDIX  AA:  Proof  of  Theorem  A,1 

A4. 1  Analysis  of  RANDOM-MATE 
Proof  of  Lemma  4. I 

Let  F  be  a  spanning  forest  of  G.  By  deleting  at  most  1/2  the  edges  of 
F  (but  no  active  vertices),  we  get  F'  cf,  a  forest  of  trees  of  height  1,  which 
contains  all  the  active  vertices.  On  the  average,  at  least  1/4  of  the  leaves  of 
each  tree  of  F'  are  mated,  since  their  root  has  probability  1/2  of  being 
assigned  female ,  and  half  of  the  leaves  on  the  average  will  be  (independently) 
assigned  male.  Hence  with  probability  1/2,  at  least  1/8  of  all  active  vertices 
are  mated.  (Note:  we  can  improve  this  result  to  show  ^  1/4  of  all  active 
vertices  are  mated  on  the  average.)  □ 

Proof  of  Claim  4.1.  We  prove  this  by  induction  on  the  number  of  relations  of 
the  main  loop.  This  initially  holds  when  R(v)  =v  for  all  v  EV.  Suppose  the 
claim  holds  up  to  the  t  -1  iteration  of  the  main  loop.  Then  a  R-root  r  is 
merged  into  an  R-root  r'  by  assigning  R(R(r))  ■*-R(r')  only  if  3{v,u}£E 
such  that  r  =R(v)  and  r'  =R(u).  Hence  the  claim  hoi  o  after  the  t'th 
iteration  of  the  main  loop.  n 

Proof  of  Lemma  A.  2. 

Let  R^  be  the  value  of  the  array  R  just  before  the  beginning  of  the  t'th 

iteration  of  the  main  loop.  Let  a  R-root  r  be  active  if  3{u,v}  €e  such  that 

R^fv)  =r  but  Rt (v)  /Rt(u).  Let  n  be  the  number  of  distinct  active  R^-roots  on  the 

t'th  iteration.  Let  the  execution  of  RANDOM-MATE  of  the  t'th  iteration  be  a  success 

if  nt+i  <Yn  where  Y=l/8.  By  Lemma  4.1,  the  total  number  of 

successes  after  t^  iterations  is  lower  bounded  by  a  binomial  variable  with 

parameters  tQ,  1/2.  Observe  that  if  we  have  log^  n+1  successes  after  t^ 

iterations,  then  n  =0.  By  the  Chernoff  bounds  on  the  binomial  given  in  Lemma  Al.l 

t0 

of  the  Appendix  A,  Vet  >1  3cQ  such  if  t  =  cQ  logn  then 

Prob(n  =0)  ^Prob(the  number  of  successes  after  t.  iterations  is>l+log  n)  >l-l/n 


Cl 

Thus  with  probability  ^  1  -  1/n  ,  after  logn  iterations  of  RANDOM-MATE  there  are 

no  remaining  active  vertices.  o 

2 

A4.2  An  Optimal  Algorithm  for  >  n(log  n)  Edges 

In  this  subsection  we  take  as  input  a  graph  G  =  (V,E)  such  that  V={l,...fn} 

2 

and  the  edge  set  E  is  of  size  m^n(log  n)  . 

Our  algorithm  RANDOM- MATE1 will  be  a  simple  modification  of  RANDOM-MATE. 

To  avoid  unnecessary  notation  (ie,  the  use  of  ceiling  and  floor  functions)  we 
assume  without  loss  of  generality  that  log  n  divides  m. 

We  will  use  a  total  of  P  =n/log  n  processors.  We  will  begin  by  sorting  the 
list  D (E)  of  directed  edges  into  adjacency  list  arrays  E  (1) , . . .  ,E  (n)  where  E (v) 
is  an  array  containing  the  sets  of  directed  edges  departing  vertex  v.  Since 
|d(E)'|  =  2  [ E |  ,  by  Theorem  3.1,  this  sorting  can  be  done  in  5(log  n)  time  using 
P  processors. 

We  assign  to  each  vertex  v€  V  a  set  of  log  n  consecutive  rvu.-e ssors 
P^  =  {  (v-1)  logn  +  l,...,v  logn}.  We  alter  the  main  loop  of  RANDOM-  MATE  to  execute 
c^  logn  times  (instead  of  cQ  logn  times)  where  c^  is  a  co  nt  to  be  determined 

below.  We  also  delete  the  original  code  at  label  merge,  and  sul  -itu1  in  its 
place; 

merge:  for  each  v  €v  in  parallel 

do  for  each  processor  -  EPv  in  parallel 
do  if  E(v)  ^0  then 

choose  a  random  edge  (v,u)  €e(v)  fi 
MATE  (u,v) 
od 
od 

An  edge  {v,u}  is  an  E-loop  if  R(v)  =R(u). 

Claim  4.2.  Va^l  3c^,  with  probability  ^  1  -  l/na  there  are  at  most  m/log  n  edges 
of  E  which  are  not  R-loops  after  the  log  n  iterations  of  the  main  loop  of 


random- Mate' 
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Proof .  Let  R^  be  the  value  of  the  R  array  just  before  the  t'th  iteration 

of  the  main  loop.  Let  Rt-root  r  be  semiaotive  if  at  least  1/log  n  of  the 

edges  { {v,u}  6 E | R(v)  =  r}  are  not  R^-loops.  Let  n^  be  the  number  of  semiactive 

R^-roots.  We  can  assume  without  loss  of  generality  that  n^4.  For  any  semiactive 

Rt~root  r,  with  probability  at  least  (1  -1/log  n) log  n>l/4,  some  process  of  Pv 

chooses  an  edge  {vfu}  €e  on  step  t  such  that  R(v)  =  r,  R(u)  /r  and  we  execute 

MATE  (v,  u) .  Also,  prob  {SEX  (R(v) )  =male  and  SEX(R(u))  =  female)  =1/4.  Hence  using 

arguments  similar  to  Lemma  4.1  we  have  with  probability  at  least  1/2,  at  most  y'n 

semiactive  R  -roots  are  not  merged  on  step  t  to  other  R  -roots  where  y'  =31/32. 

^  t 

Let  the  t'th  iteration  of  the  main  loop  be  successful  if  n^+1^n^y'.  We  have  just 
shown  the  t'th  iteration  is  successful  with  probability  at  least  1/2.  The  total 
number  of  successes  after  t1=c1  lo<3n  iterations  is  lower  bounded  by  a  binomial 
variable  with  parameters  t^,  1/2.  The  Chernoff  bounds  of  Lemma  Al.l  imply: 

Va  >1  3c^  with  probability  ^  l-l/na,  the  number  of  successes  after  t^  iterations 

is  >  log  ,  n.  But  n'  =0  after  1  + log  ,  n  successful  iterations,  and  hence  there 

1  1 1  1 

are  no  remaining  semiactive  R-roots. 

After  completing  execution  of  these  modified  main  loop,  RANDOM-MATE'  deletes 
each  R- loop  edge  {u,v}€E  (where  R(u)  =R(v))  in  time  0(log  n)  using  P 
processors.  Finally,  RANDOM-MATE'  executes  the  original  procedure  RANDOM-MATE 
described  in  4.1  to  collapse  the  resulting  graph  to  its  connected  components. 

Hence  we  have 

LEMMA  fr.3.  In  time  6 (log  n)  using  m/log  n  processors  we  can  compute  CC(G)  for 
any  graph  G  wzth  n  vertices  and  m>n(log  n)  edges. 

A4. 3  An  Optimal  Algorithm  for  >  n(1og  Edges 

LEMMA  k.k.  Given  any  graph  G  =  (v,E)  with  n  vertices  and  m>n(log  n)1^3  edges t 
we  can  compute  CC(G)  in  time  6 (log  n)  using  (m+n)/log  n  processors. 

To  prove  this  lemma,  we  describe  another  modification  of  RANDOM-MATE  which 
we  call  RANDOM-MATE".  We  will  give  a  simplified  description  of  RANDOM-MATE".  We 
will  take  as  input  a  graph  G  =  (V,E)  with  n  vertices  m>n(log  n)  edges. 
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1/2 

In  this  case,  we  assign  to  each  processor  7T  £  [m/log  n]  a  set  of  (log  n) 

distinct  consecutive  vertices  of  V=  {l,,..,n}.  Also  we  again  construct,  by  sorting 
E,  adjacency  list  arrays  E (1) , . . . ,E (n) . 

1/4 

In  this  case  we  will  execute  the  main  loop  only  c 2  (log  n)  iterations  where 
c2  is  a  constant  to  be  defined  below.  We  modify  the  main  loop  by  substituting  in 
place  of  the  code  at  label  merge ,  an  assignment  of  R'  (v)  R(v)  for  each  vertex  v£V 
and  then  the  following  code: 

merge :  for  each  processor  tt  £  [m/log  n]  in  parallel  do 
for  each  v  £  V„ 

do  for  i=l,...,(log  n)^4 

if  R(v)  =  R' (v)  and  E(v)  /  0  then 
do  choose  a  random  edge  (v,u)  £  E(v) 

MATE (v, u)  fi  od 
od 

The  test  R(v)  =  R’ (v)  insures  that  the  resulting  R-trees  will  be  of  height  <1  after 

executing  the  code  at  label  collapse .  Note  that  the  resulting  main  loop  takes  time 
3/4 

O(log  n)  per  iteration,  and  so  the  total  time  is  O(log  n)  using  m/log  n  processors 

1/12 

CLAIM  k.  3.  3c2  such  that  with  probability  1  as  n  -*00,  there  are  at  most  m/(log  n) 

1/4 

edges  of  E  which  are  not  R-loops  after  c2  (log  n)  iterations  of  the  main  loop  of 
RAN  DOM- MATE  "  . 


Proof  of  Claim  4.3.  The  proof  is  almost  identical  to  that  of  Claim  4.2,  except  that 

1/12 

in  this  case  we  must  redefine  a  R^-root  to  be  semiactive  if  at  least  l/'(log  n)  of 

the  edges  { {v,u}  £ e| R(v)  =v}  are  not  R-loops.  If  we  let  n”  be  the  number  of  (so 

defined)  semiactive  R  -roots,  th^n  again  we  have  Prob(n"  <n"y')  <1/2  where  again 

-n  )1/4  . 

Y>  =31/32.  Hence  with  probability  ^1-2  °9  n  no  semiactive  R-root  exists 


after  c2(log  n) 


1/4 


iterations,  where  c2  is  determined  by  Lemma  Al.l. 


Claim  4.3  implies  that  after  12  applications  of  RANDOM-MATE",  the  resulting 


graph  has  only  m/log  n  edges,  and  hence  we  can  apply  RANDOM-MATE,  Lemma  4.1, 
to  completely  collapse  the  graph  and  hence  to  determine  its  connected  components 
in  6 (log  n)  time  using  m/log  n  processes. 
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1/3 

Let  G=  (V,E)  to  be  a  graph  with  n  vertices  and  m^n(log  n)  edges.  By 

Lemma  4.4,  it  suffices  to  show  in  time  O(log  n)  using  P  =  (m+n)/log  n  processors 

we  can  reduce  the  problem  of  computing  CC(G)  to  the  problem  of  computing  the  connected 

components  of  a  partially  collapsed  graph  with  <  O(m/{log  n)1//3)  vertices  and  <  m 

edges.  Without  loss  of  generality  we  can  assume  m  ^n  -1  and  2m  is  divisible  by  log  n 

Let  D(E)  =( (v  ,u  ,),..., (v  ,u  ))  be  a  list  of  the  directed  edges  derived  from 
i  l  2m  2m 

E.  We  begin  by  computing  a  random  permutation  a  of  (l,...,2m)  by  Corollary  3.1 
in  time  O(log  n)  using  P  processors.  We  initially  assign  R(v)  =v  and 
SEX (v)  = female  for  each  vertex  v  €V.  This  can  easily  be  done  in  OClog  n)  time 
using  P  processors.  Then  we  execute  the  following  log  n  steps: 
for  t=l,...,log  n  do 

for  each  processor  tt  €  [2m/log  n]  in  parallel 


MATE'  (v  .  .  ...  .  ,u 

a  (  (tt-1)  log  n+t)  c 


where  we  define: 


procedure  MATE' (v,u) 

SEX  (R  (v)  )  «-  male 

if  SEX  (R  (u) )  =  female  then  R(R(v))  •*-  R(u)  fi 
Note  that  each  of  iteration  step  takes  only  time  0(1)  using  P  processors.  Let  a 
vertex  of  R(G)  be  special  if  either  it  is  isolated,  or  has  degree  ^  (log  n)1//3,  or 
is  adjacent  (by  an  edge  of  R  (G) )  to  a  vertex  of  degree  >  (log  n) 1//3. 

CLAIM  k. k.  The  resulting  partially  collapsed  graph  R(G)  has  <  0(n/log  n)3/^3) 
vertices  which  are  not  special,  and  <  m  edges. 

Proof  of  Claim  4.4.  Let  R  be  the  value  of  R  just  before  the  t'-th  iteration. 

Let  E^  be  the  set  of  directed  edges  chosen  on  the  t'-th  iteration,  so  D (E)  =Ufc  Efc. 
Let  be  the  number  of  edges  (v,u)  £ Efc  such  that 

(i)  v  has  degree  (log  n)1//3  in  R^  (G)  and 
(ii)  a  processor  tt  executes  MATE(v,u)  but  finds  SEX(R(u))  ^  female ,  so 
does  not  assign  R(R(v))  -«-R(u). 


Observe  that  initially,  all  vertices  v  €v  have  been  assigned  SEX(v)  = female, 

and  that  on  successive  stages  t=l,...,log  n  at  most  m/(log  n-t)  ^n(log  n)1//3/(log  n- 
vertices  v€v  have  been  assigned  SEX(v)  =  male. 

We  can  upper  bound  Mt  by  a  hypergeometric  variable,  and  then  apply  Lemma  A1.4 

Ct 

to  show  that  M  is  upper  bounded  (for  probabilities  in  the  range  from  1/n  to 
l-l/na)  by  a  binomial  variable  with  parameters  m/log  n,  max ((log  n)1//3/(log  n-t),l). 
Applying  (Hoeffding's  inequality)  Lemma  A1.2,  we  get  Z3°^  n  Mt  upper  bounded  by 

a  binomial  with  mean  n  m/((log  n)^3(log  n-t))^  ( (m  loglog  n)/(log  n)2^2) 

1/3  2/3 

^  O(n/(log  n)  )  and  parameters  m,  0( (loglog  n)/(log  n)  ).  Then  Z  + 

1/4 

O(n/(log  n)  )  gives  an  upper  bound  on  the  number  of  vertices  of  R(G)  which  are 

not  special.  Finally  we  apply  the  Chernoff  bounds  of  Lemma  Al.l  proving  the  Claim,  o 

To  complete  the  reduction,  we  delete  each  isolated  R-root  of  R(G) ,  and  for  each 

r£R(V)  with  degree  <  (log  n)1,/3  in  R (G)  ,  we  reassign  R(r)«-r'  if  there  exists  an 

1/3 

edge  (r,r')£R(E)  such  that  r'  has  degree  ^  (log  n)  in  R(G)  .  We  also 

update  R'  (v)  «-  R(R(v) )  for  each  v€v.  These  final  steps  can  easily  be  done  in  O(log  n) 
time  using  (m+n)/log  n  processors.  The  resulting  further  collapsed  graph  R' (G)  has 
^  O(n/(log  n)3//3)  vertices  and  ^  m  edges.  Therefore  we  can  apply  Lemma  4.4  to 
completely  collapse  R' (G)  to  R"(G).  The  array  R"  specifies  the  connected 
components  of  G.  Thus  we  have  shown: 

LEMMA  4.5.  Given  any  graph  G  with  n  vertices  and  m^n(log  n)1//3  edges }  we  can 
conpute  CC (G)  in  O(log  n)  tine  using  (m+n)/log  n  processors. 

This  completes  the  proof  of  Theorem  4.1. 


